<h1 align="center">Mani-WM: An Interactive World Model for Real-Robot Manipulation</h1>

Scalable robot learning in the real world is limited by the cost and safety issues
of real robots. In addition, rolling out robot trajectories in the real world can
be time-consuming and labor-intensive. In this paper, we propose to learn an
interactive world model for robot manipulation as an alternative. We present a
novel method, Mani-WM, which leverages the power of generative models to
generate realistic videos of a robot arm executing a given action trajectory, starting
from an initial given frame. Mani-WM employs a novel frame-level conditioning
technique to ensure precise alignment between actions and video frames and
leverages a diffusion transformer for high-quality video generation. To validate the
effectiveness of Mani-WM, we perform extensive experiments on four challenging
real-robot datasets. Results show that Mani-WM outperforms all the comparing
baseline methods and is more preferable in human evaluations. We further showcase
the flexible action controllability of Mani-WM by controlling the virtual robots in
datasets with trajectories 1) predicted by an autonomous policy and 2) collected by
a keyboard or VR controller. Finally, we combine Mani-WM with model-based
planning to showcase its usefulness on real-robot manipulation tasks. We hope that
Mani-WM can serve as an effective and scalable approach to enhance robot learning
in the real world.

<img src="assets/images/intro.png" alt="introduction" width="100%"/>

## Installation

To set up the environment, run the following command:
```bash
bash scripts/install.sh
```

## Dataset

The complete dataset structure can be found in [dataset_structure.txt]().
For anonymity reasons, we are unable to provide download links for the dataset and checkpoints. We will make them publicly available after the paper is accepted.

## Language Table Application

We recommend starting with the Language Table application. This application provides a user-friendly keyboard interface to control the robotic arm in an initial image on a 2D plane:

```bash
python3 application/languagetable.py
```

## Training

Below are example scripts for training the WM-Frame-Ada model on the RT-1 dataset.

To accelerate training, we recommend encoding videos into latent videos first. Our code also supports direct training by setting `pre_encode` to `false`.

### Single GPU Training
```bash
python3 main.py --config configs/train/rt1/frame_ada.yaml
```

### Multi-GPU Training on a Single Machine
```bash
torchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --rdzv_endpoint {node_address}:{port} --rdzv_id 107 --rdzv_backend c10d main.py --config configs/train/rt1/frame_ada.yaml
```

## Evaluation

Below are example scripts for evaluating the WM-Frame-Ada model on the RT-1 dataset.

### Short Trajectory Setting

To quantitatively evaluate the model in the short trajectory setting, we first need to generate all evaluation videos.

Generate evaluation videos:
```bash
torchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --rdzv_endpoint {node_address}:{port} --rdzv_id 107 --rdzv_backend c10d main.py --config configs/evaluation/rt1/frame_ada.yaml
```

We provide an automated script to calculate the metrics of the generated short videos:
```bash
python3 evaluate/evaluation_short_script.py
```

### Long Trajectory Setting

Generate all long videos in an autoregressive manner.

   Generate the scripts for generating long videos in a multi-process manner:
   ```bash
   python3 scripts/generate_command.py
   ```

   Run:
   ```bash
   bash scripts/generate_long_video_rt1_frame_ada.sh
   ```

Use the automated script to calculate the metrics of the generated long videos:
   ```bash
   python3 evaluate/evaluation_long_script.py
   ```
